Induction of Root and Pattern Lexicon for Unsupervised Morphological Analysis of Arabic
نویسندگان
چکیده
We propose an unsupervised approach to learning non-concatenative morphology, which we apply to induce a lexicon of Arabic roots and pattern templates. The approach is based on the idea that roots and patterns may be revealed through mutually recursive scoring based on hypothesized pattern and root frequencies. After a further iterative refinement stage, morphological analysis with the induced lexicon achieves a root identification accuracy of over 94%. Our approach differs from previous work on unsupervised learning of Arabic morphology in that it is applicable to naturally-written, unvowelled text.
منابع مشابه
Unsupervised Induction of Arabic Root and Pattern Lexicons using Machine Learning
We describe an approach to building a morphological analyser of Arabic by inducing a lexicon of root and pattern templates from an unannotated corpus. Using maximum entropy modelling, we capture orthographic features from surface words, and cluster the words based on the similarity of their possible roots or patterns. From these clusters, we extract root and pattern lexicons, which allows us to...
متن کاملFixing the Infix: Unsupervised Discovery of Root-and-Pattern Morphology
We present an unsupervised and languageagnostic method for learning root-andpattern morphology in Semitic languages. We harness the syntactico-semantic information in distributed word representations to solve the long standing problem of root-and-pattern discovery in Semitic languages. The root-and-pattern morphological rules we learn in an unsupervised manner are validated by native speakers i...
متن کاملContext-dependent type-level models for unsupervised morpho-syntactic induction
This thesis improves unsupervised methods for part-of-speech (POS) induction and morphological word segmentation by modeling linguistic phenomena previously not used. For both tasks, we realize these linguistic intuitions with Bayesian generative models that first create a latent lexicon before generating unannotated tokens in the input corpus. Our POS induction model explicitly incorporates pr...
متن کاملLinguistically Informed and Corpus Informed Morphological Analysis of Arabic
Standard English PoS-taggers generally involve tag-assignment (via dictionary-lookup etc) followed by tag-disambiguation (via a context model, e.g. PoS-ngrams or Brill transformations). We want to PoS-tag our Arabic Corpus, but evaluation of existing PoStaggers has highlighted shortcomings; in particular, about a quarter of all word tokens are not assigned a fully correct morphological analysis...
متن کاملLearning to Identify Semitic Roots
The standard account of word-formation processes in Semitic languages describes words as combinations of two morphemes: a root and a pattern. The root consists of consonants only, by default three (although longer roots are known), called radicals. The pattern is a combination of vowels and, possibly, consonants too, with ‘slots’ into which the root consonants can be inserted. Words are created...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013